Juan Fernández Afonso

Introduction

The goal of this project is to provide insight into the key properties that influence the quality of red wine. To do so, we will analyze the information contained in the ./wineQualityReds.csv dataset. This is a tidy set of data which contains 1599 measurements of 13 different fields. We will study the distribution of these features and their correlations paying especially attention to how they influence the score/quality of the wine.

This project is organized as follows: in Sec. 1 the dataset is described together with some general descriptive statistics. In Secs. 2-4 we carry out univariant, bivariant and multivariant analysis. Finally, we conclude summarising the main results of this report and suggesting possible scenarios to increase the quality of this analysis.

Section 1: Dataset description

usually the first step in EDA is to understand the structure of the data. As mentioned in the introduction, the original dataset is composed of 1599 distinct samples of red wine. For each one, 13 features were measured. The structure of these fields is displayed in the panel below. With the only exception of the “score” and “X” (sample id number), the features are numerical variables of chemical-related composition. Acid-related information appears to be very important since it was measured in four different fields: fixed acidity, volatile acidity, citric acid and pH. In addition to the original 13 features, two new fields were computed: the total acidity and the score as a result of adding up the fixed and volatile acidity and by grouping the wines into five categories acording to the quality. The score of a given sample is defined as “very bad”, “bad”, “average”, “good” or “excellent”.

## 'data.frame':    1599 obs. of  15 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ score               : Ord.factor w/ 5 levels "very bad"<"bad"<..: 3 3 3 3 3 3 3 4 4 3 ...
##  $ total.acidity       : num  8.1 8.68 8.56 11.48 8.1 ...

The median, mean, maximum and minimum are displayed for each field in the table below. By comparing the mean and median we can obtain a general idea about a given population skewness. Typically, a positive (negative) skewed distribution is characterized by a long right (left) tail usually populated by outliers. The skewness can be measured (in standard deviation units) with the mean-median difference: \[\text{skeweness} \propto \frac{\mu - \nu}{\sigma}\], where \(\mu\), \(\nu\) and \(\sigma\) are the mean, median and standard deviation of the feature distribution probability. The following table provides a statistical summary of the different features.

## [1] "X"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   400.5   800.0   800.0  1199.5  1599.0 
## 
## [1] "fixed.acidity"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90 
## 
## [1] "volatile.acidity"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800 
## 
## [1] "citric.acid"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000 
## 
## [1] "residual.sugar"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500 
## 
## [1] "chlorides"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100 
## 
## [1] "free.sulfur.dioxide"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00 
## 
## [1] "total.sulfur.dioxide"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00 
## 
## [1] "density"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037 
## 
## [1] "pH"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010 
## 
## [1] "sulphates"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000 
## 
## [1] "alcohol"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90 
## 
## [1] "quality"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000 
## 
## [1] "score"
##  very bad       bad   average      good excellent 
##         0        63      1319       217         0 
## 
## [1] "total.acidity"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.120   7.680   8.445   8.847   9.740  16.285

The quality is the only negative skewed quantity present in the dataset. There are fields with almost zero skewness such as “pH” or “density”. As we will see in the next section, this is explaines due to the Gaussian-like shape of their population distribution.

It is worth to mention that this dataset contains only bad, average and good wines with population ratios of 3.9%, 82.5% and 13.6% respectively. Here we see that the data is not uniform in the score/quality spectrum which may result in biased results. It is expected then a peaked distribution for the “quality” field centred in 5-6 and with small \(\sigma\). The different population distributions are shown and described in Secs. 2 and 3.

Section 2: Univariant analysis

In this section, we explore the distribution of the dataset features. I have considered different bin widths according with the distinct variables domains.

Fixed acidity \((g/dm^3)\)

The fixed acidity contains information about non-volatile acids. As we can see in the figure above, the fixed acidity has a positively skewed distribution with mean and median of 7.90 and 8.32 respectively.

Volatile acidity \((g/dm^3)\)

The acetic acid is associated with this feature which, in high concentrations, tipically leads to an unpleasant vinegar-like taste that negatively affects the wine quality. It is expected that the increment of this quantity reduces the sample score. With the proper bin tunning and adjusting the x-axis scaling (right panel in logarithmic scale) we can identify a bi-modal distribution with two peaks located around 0.4 and 0.6.

Citric acid \((g/dm^3)\)

This feature describes the concentration of citric acid found in the samples. It usually leaves in the wine fresh-like flavours. Because of this, we expect a positive correlation with the quality. One clear outlier was found around 1.0. The long tail to the right is a footprint of a positively skewed distribution, however, the overall shape is far to be gaussian-like.

pH

pH is a parameter often used in chemistry to measure the acidity of a given solution/compound. It takes values from 0 (very acid) to 14 (very basic). The pH distribution is centred at 3.3. The Gaussian-like shape explains the almost zero skewness. Most of the samples have a pH between 3.0 and 3.6.

Total acidity \((g/dm^3)\)

This feature was calculated by adding up the fixed and volatile acid concentrations. The resulting distribution is positively skewed with outliers present above 12.6. This behaviour was expected considering the skewness of the fixed and volatile distributions. We see that the fixed acidity dominates the histogram since the double peak feature of the volatile acidity is no longer present.

Residual sugar \((g/dm^3)\)

This feature contains information about the sugar present in the sample after the fermentation. The distribution of this variable is formed by a narrow peak located at 2.0 with a long tail extending towards positive concentrations. The outliers are located within a long range interval roughly from 4.0 to 16.0. The minimum sugar concentration is 0.9.

Chlorides \((g/dm^3)\)

The chlorides present a similar behaviour like the residual sugar with a narrow gaussian-like peak followed by a long tail. The peak is located around 0.075.

Sulphates \((g/dm^3)\)

On the other hand, the sulphates have a positively skewed gaussian-like distribution with 0.6200 and 0.6581 as mean and median values with a similar shape to the previous two features.

Total sulfur dioxide \((mg/dm^3)\)

The sulfur dioxide is often found diluted in wine. This field contains information about the total amount of it (free or bounded). The histogram looks like a (quasi-exponential) decay stating around 6 with tail extending up to around 150. Two outliers are present at 289 and 278.

Free sulfur dioxide \((mg/dm^3)\)

The free sulfur dioxide histogram is also positively skewed. The distribution has a long tail decaying up to around 65 with a mean and median of 14.00 15.87 respectively.

Density \((g/cm^3)\)

The density is the ratio between the mass and the volume occupied by a given sample. This histogram looks like a Gaussian distribution centered around 0.99.

Alcohol (% by volume)

The alcohol has a quite irregular histogram with a peak-like feature centred around 9 and a tail extending towards the positive x-axis. This is a positively skewed distribution where the mean and median are 10.20 and 10.42 respectively.

Sample rating

The above figure ilustrates the distribution of the samples scores where the quality was introduced as color given. As mentioned before, the wines selected are not uniform in quality (even some of them are completely missing). This lack of data could affect the related anaylisis. As expected, this distribution looks like a narrow gaussian centered in at 5.636.

Section 3: Bivariant analysis

In this section, we will analyze the dependence between two given features. To this end, it is useful to compute the correlation matrix since it provides good insight about linear correlations.

Notice that the blank spaces correspond to features with almost zero correlation (\(\text{corr}<0.1\)). As expected, there is a non-negligible correlation between the acid-related features: pH, citric acid, fixed, volatile and total acidity. Also, these features show some correlation with wine density. To my surprise, the quality feature does not strongly correlate with any other one in the dataset but alcohol, which could indicate that maybe this dataset does not have the proper data to obtain reliable information about quality-related variables. Another interesting observation is that the total and fixed sulfur dioxide seems to be isolated from the rest of the data fields. We will analyse all of these observations through this section.

Sulfur dioxide concentration

The following plot shows the dependence between the free and total sulfur dioxide. The linear regression analysis performed shows a linear correlation with \(R^2 = 0.45\). This correlation factor is far from 1 (linear model) due to the increasing data dispersion with the total sulphur dioxide concentration. The positive dependence behaviour can be explained by considering a positive relation between the gas concentration in a liquid and the gas decay from the solution. However, this does not explain the dispersion-related trend which may be indicating a more complex chemical relation. This behaviour could also depend on the type of red wines analyzed since some of them may have a bigger absorption coefficient than others. In Sec. 4, we will further explore this sulfur dioxide dependance with the wine quality.

Chemical features vs. wine quality

As we noticed in the correlation matrix figure, the quality does not strongly depend on any other feature. However, this coefficient only provides information about the linear correlations leaving a window of possible hiden non-linear relations. In this subsection, we will study how the chemical properties are afected by the wine quality.

As we can see in the above figure, the acid-related features behave differently concerning wine scores. Volatile acidity and citric acid are the variables that show a bigger dependence on them. The first one decreases rapidly with better wines while the citric acid is positively correlated to the scores. This behaviour agrees with the aforementioned volatile acidity and unpleasant vinegar-like taste. The pH slightly decreases. The small slope of this field could be justified by the acidity compensation coming from different features. Finally, the fixed acidity does not show a uniform tendency. It presents a slight decay for scores of 3-4 and rises up again from 5-6 but,
due to the lack of statistics for extreme wine scores, this assumption is not completely reliable.

The above panel shows the non-acid features vs. quality. We see that the alcohol shows a nearly flat behaviour for scores from 3-5 and rises up for medium-high quality wines. The sulphates also present a relatively monotonic positive slope. In contrast, the increment of density leads to poorer wine quality. Finally, the total sulfur dioxide presents a maximum for average wines being the extreme scores (3 and 8) the minima of this dispersion.

In general terms, we can conclude that there is a positive (negative) dependence between citric acid, alcohol and sulphates (volatile acidity, pH and density) with respect to the wine quality. On the other hand, the relation between the other studied magnitudes is still unclear.

Feature density estimations and quality

We can test the conclusions from the previous subsection by studying the extrapolated density estimations of different features for distinct wine scores. In the following panels the single-quality density estimations of volatile acidity, pH, citric acid, alcohol, density and sulphates are shown.

In general terms, the density estimates agree with previous results. However, the distribution of some population profiles is too narrowly distributed to make a clear statement. It is worth to mention the agreement of the volatile acidity, citric acid, alcohol and sulphates with the previous section conclusions. From their density estimates we observe that good wines distribution are clearly shifted from the average and bad ones.

Section 4: Multivariante analysis

In this section, we analyze the relationship between multiple variables. In all of the plots, the wine quality is displayed as colour. We will explore the free vs. total sulfur dioxide concentration as well as some quality-related features derived from the previous sections.

Total vs free sulfur dioxide

As we can see, adding the quality layer to the sulfur dioxide plot does not provide any new information. One way to correct this behaviour is to increase the statistics of the data for the good wines. This population increment could lead to a better correlation between these fields.

Final plots and summary

We have studied how different chemical-related features of red wine correlate each other with the goal to understand what properties make a given wine better than others. The main of this report are summarized with the following figures.

Figure 1

It was found that the volatile acidity is one of the features that most influence the wine quality. The is a negative linear correlation of -0.39 which could hide a non-linear behaviour among those fields. In general terms, we have observed that the volatile acidity suppress the wine quality. As we can see from the above the average wines dominate the sample distribution with the 82.5% of the samples. It was observed that the scores are normally distributed but there is no data available for extreme quality values 1-3 and 9-10.

Figure 2

As shown in this figure, the alcohol also plays an important role concerning wine quality. For the samples analysed, we found a big difference between the extrapolated population densities for good and average-bad wines. This, together with the positive correlation between alcohol and score, makes clear that for this dataset the increase of alcohol concentration has a positive impact in the wine quality.

Figure 3

Finally, we have observed that the combination of volatile acidity and citric acid provides a phase diagram from which we can identify a cluster containing the good wines. These two parameters together with the alcohol and sulphates concentrations provide a suitable set of variables to select a wine with good quality.

Reflection

As mentioned before, the main problem of the analyzed dataset is the lack of data for non-average wines. This could hide interesting correlations between features or alter the aforementioned generalizations for these rates.

However, it was possible to address the main goal of this project concerning the quality-related features both for bivariant and multivariant analysis. We also studied the correlations between other features such as the sulfur dioxide or acids concentrations.

The results presented in this report could be improved upon increasing the statistical population of bad and good wines. It would also be interesting to provide data of extreme scenarios like very bad and very good ones to test the strength of the conclusions. Machine learning algorithms could be employed to treat this wine quality study as a classification problem. To this end, the statistical improvement of the dataset would be key in order to avoid biased results. Other direction for a future project could be to study together the white and red wines to see if the conclusions derived from this report can be applied to the first ones.